## CS6230 : CAD for VLSI

## Project 1: Multiply - Accumulate (MAC) Unit Design

## Team Members :-

- 1. S. Bharath Kumar, CS23M004
- 2. Abinash Chaudhary, CS23M802

 $\underline{MAC}$  is performed on three numbers A, B and C using MAC = A \* B + C. We perform MAC implementation using two selections for different datatypes-

S1 - A(int8), B(int8), C(int32) and MAC(int32) S2 - A(bf16), B(bf16), C(fp32) and MAC(fp32)

<u>Unpipelined Design</u>:- We use shift and add method for multiplication of two numbers and then add to the third input. For addition, we use ripple carry adder technique as we do not use + and \* operators.

In our code, we have named the package name as *mac* which has to be same as filename. We have interface *MACInterface* with predefined methods for A, B, C, and S1\_or\_S2. For selection of S1 or S2, we use bool mode. False(0) for S1 and True(1) for S2. We create a module *mkMAC* to implement the interface and setup registers for storing the inputs A,B,C, selecting S1 or S2 and storing MAC results.

When we select S1 (False) – function Bit#(32) ripple\_add(Bit#(32) x, Bit#(32) y); is a function to implement ripple carry adder without using + operator. We perform AND operation for carry and XOR operation for Sum. The for loop iterates bitwise over x and to do addition with carry. function Bit#(32) shift\_and\_add\_mul(Bit#(8) a, Bit#(8) b); implements multiplication of two 8-bit numbers using shift and add method without using \* operator. This function will shift a to the left and accumulates it to the product whenever the corresponding bit in b is 1. The product will not be accumulated when b is 0. We use these two methods to implement MAC when selecting S1 datatype. method ActionValue #(Bit#(32))  $start_MAC()$ ; does the MAC calculation for the selected mode, here considering S1 mode,  $shift_and_add_mul$  for multiplying A and B and then  $ripple_add$  for adding it to C. method Bit#(32)  $get_MAC_result()$ ; returns the MAC result in 32-bit format ( $mac_result_S1$  for S1 mode).

When we select S2 (True) – For S2multiplication, we first convert bf16 to fp32 format by extending the mantissa without altering the sign and exponent part. Conversion is done using the function  $function\ Bit\#(32)\ bf16\_to\_fp32(Bit\#(16)\ bf16)$ ; After extending bf16 inputs A and B to fp32 format, the 1st bit is sign, next 8 bits for exponent and extend the mantissa to 23 bits by appending zeroes.

We use function Bit#(32)  $multiply\_bf16$  (Bit#(16) a\_bf16, Bit#(16) b\_bf16); to perform multiplication of two bf16 without using \* operator. This multiply function multiplies two bf16 numbers in 32-bit. The product's sign will be the XOR of the signs of A and B. Next, we perform exponent addition by adding the exponents of A and B and then adjusting it with bias (8d'127- 8 bit of 127 decimal). We do the mantissa multiplication using bitwise shift-and-add method. We multiply mant\_a and mant\_b with a leading 1 including which it represents a 24-bitnormalised mantissa. The for loop iterates over each bit of mant\_b and when each time, mant\_b[i]=1, temp\_A is left

shifted and XORed with mant\_product and accumulates the result. The final result of mantissa multiplication will be 48-bit mant\_product. We consider it as normalised, if its highest bit mant\_product[1] is 1(i.e.) mant\_product is right -shifted by 1 bit and the exponent is incremented by 1 bit. Then, the final mantissa is assigned to final\_mant and exp to final\_exp. Then, after multiplication we need to add the result to C, for that we use <code>bitwise\_add</code> to add the multiplication result to 32-bit C. Similar to S1 ripple carry adder, here also we XOR a and b bitwise and a carry is done using AND, OR operations. After obtaining the final sum, we check round bit (22) and extra bit (23), if both are 1, we round up the result by incrementing it by 1. The final result is thus stored in mac result S2 (32-bit format).

<u>Pipelining</u> – Pipelined design splits MAC operation into Multiplication, Rounding and addition stages. Each stage will perfrom a part of calculation and intermediate results are stored It enables new data be entered into the pipeline every clock cycle improving the throughput as many operations can be processed simultaneously at different pipeline stages. Pipelining enables high speed and faster processing by completing every operation in smaller and faster steps.

<u>Pipeline stage 1 (Rule r1 pipe stage1)</u>:-When S1 mode (Integer multiplication and addition) is selected, it performs multiplication (4-bit int values) and accumulation using shift and add and ripple carry adder operation and the result psum is stored in the IntAdderPipeStage struct (which stores the intermediate results for this stage). When S2 mode(bf point multiplication and accumulation) is selected, mac\_float converts bf8 to bf16 and multiply\_bf16 performs multiplication. mac\_float performs addition of floating-point results using bitwise addition. The result psum is stored in FloatAdderPipeStage struct.

<u>Pipeline stage 1 (Rule r1\_pipe\_stage2)</u>:- This rule is for MAC results computation. When S1 mode is selected, mac\_int function is used to multiply the upper 4 bits of val\_1, val\_2 and then accumulate them with the 16 bits of val\_3 and then combine the result psum\_msbs with the previous stage psum\_lsbs result, thus yielding a 32-bit output by combining both results. When S2 mode (fp) is selected, mac\_float is used to combine the rpoduct and sum of the higher bits. Results of both stages are combined to produce a 32-bit floating point result that will be stored in AdderResult.

<u>Verification</u> – The function <code>apply\_inputs</code> takes dut, and inputs a,b,c that are applied to the signals in the MAC module. It assigns values to dut.a, dut.b and dut.c and then waits for a rising edge of the clock dut.clk thus will sync the input and the clock. <code>mac\_random\_test</code> is used for randomized testing of the MAC design. <code>input\_a\_path</code>, <code>input\_b\_path</code>, <code>input\_c\_path</code> and <code>output\_path</code> specifies paths to files containing test vectors and expected results required for our verification of MAC results. Every input file will be read and stored as list of integers (inputs\_a, inputs\_b, inputs\_c) for a, b c in MAC calculation. The expected output values are stored in <code>expected\_outputs</code> for comparison. We can reset the MAC design by setting dut.reset to high (1), holding it for 10ns and then resetting to low (0) and again wait for 10ns. num\_tests will tell the number of randomized tests that needs to be performed.

Test loop – For testing, a random index will be chosen to select the test vectors from inputs\_a, inputs\_b, inputs\_c and expected\_outputs. The function <code>apply\_inputs</code> is called to load inputs to MAC design. Then, after 10ns, the output will be received from dut.output.value.integer . The result with the expected\_output, if it matches, it is passed else error will be thrown.